Add per-instance cost cap to swe-bench runner#741
Conversation
Some evaluations have a small minority of instances that consume disproportionately large amounts of money. For example, the Gemini-3.5-Flash swe-bench-verified run on PR OpenHands/openhands-index-results#1167 spent $1912 total across 500 instances ($3.82 mean), but 22 instances cost >$10 each and accounted for ~20% of the total spend, with a worst-case of $44.24 for a single instance. Root cause for those 22 instances: they triggered the LLMSummarisingCondenser multiple times (4x for the worst case). Each condensation rewrites the prompt prefix and therefore invalidates the provider's prompt cache. Their cache-read ratio averaged 10% versus 27% for typical instances and 45% for cheap ones, so the bulk of their tokens were billed at the full uncached price. Combined with reasoning_effort=high (which adds reasoning tokens to every uncached call) and 300+ iterations, this multiplied out to ~$44 on the worst instance. This adds a defence-in-depth measure: a `--max-cost-per-instance` flag (also exposed as `EvalMetadata.max_cost_per_instance`, default None = disabled). When set, a small callback pauses the conversation once the per-instance accumulated_cost exceeds the cap, mirroring the existing behaviour of `max_iteration_per_run`. The patch produced up to that point is still collected and submitted. This does not fix the underlying condenser cache-invalidation issue (which would need SDK-level changes), but it does cap the blast radius for any single instance across all models. Wired into the swe-bench runner first since that is where the regression surfaced; can be plumbed into the other benchmark runners in a follow-up. Co-authored-by: openhands <openhands@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review Summary
🟡 Good taste — Well-motivated feature solving a real cost outlier problem. The implementation is clean and defensive.
What's Good
- Real problem, real data: The PR addresses a genuine issue (22 instances costing >$10 each, ~20% of total spend). The root cause analysis in the PR description is solid.
- Defence-in-depth approach: Correctly positioned as a last-resort cap rather than trying to fix the underlying cache/condenser issues at the SDK level.
- Defensive error handling: Both
pause()andget_combined_metrics()failures are caught and logged, preventing cascading failures. - Idempotent trigger: Once triggered, the callback skips subsequent calls — good for performance and correctness.
- Comprehensive tests: All error paths covered (metrics failure, pause failure, unbound callback, idempotency).
Minor Suggestions
-
benchmarks/swebench/run_infer.py, Line 304 Type annotation:
callbacks: listshould becallbacks: list[Callback]for clarity, matching the Conversation's callback type expectation. -
benchmarks/utils/cost_cap.py, Line 73 Type hint:
bind(self, conversation: object)usesobject— consider using a Protocol or BaseConversation type for better IDE support and documentation. -
benchmarks/utils/models.py, Line 97-98 Documentation overlap: The field description repeats information already in the module docstring. Consider trimming to avoid drift.
No Blocking Issues
The implementation is sound, well-tested, and solves the stated problem without introducing unnecessary complexity.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
The change is additive-only (new optional parameter with no default enforcement), wraps external failures gracefully, and includes comprehensive test coverage. No breaking changes to existing functionality.
VERDICT:
✅ Worth merging: Core logic is sound, defensive coding is exemplary, and the feature solves a real operational problem.
KEY INSIGHT:
The two-phase binding pattern (pass callback to Conversation constructor, then bind conversation reference afterward) is the correct approach to ensure the callback participates in the composed chain from the first event — well thought out.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
Why
Triggered by review of OpenHands/openhands-index-results#1167 — the Gemini-3.5-Flash swe-bench-verified run spent $1,912 across 500 instances (mean $3.82), but 22 instances cost more than $10 each and accounted for ~20% of the total spend. The worst single instance cost $44.24.
Root cause for the 22 expensive instances
All 22 had the same fingerprint compared to typical runs:
The worst case (
django__django-16116, $44.24) fired theLLMSummarizingCondenser4 times during its 1069 events (at events 367, 552, 737, 922). Each condensation rewrites the prompt prefix and therefore invalidates the provider's prompt cache, so the bulk of subsequent prompt tokens are billed at the full uncached price (~10× the cached rate). Combined withreasoning_effort=high(which adds reasoning tokens to every uncached call) and 300+ iterations, this multiplied out to:20 of the 22 high-cost instances were resolved, so the agent was making progress — it just took too many iterations and burned too much money doing so.
What this PR does
Adds a small, opt-in defence-in-depth measure: a
--max-cost-per-instanceCLI flag (also exposed asEvalMetadata.max_cost_per_instance, defaultNone= disabled). When set, a callback pauses the conversation as soon as its accumulated cost exceeds the cap, mirroring the existing behaviour ofmax_iteration_per_run. The patch produced up to that point is still collected and submitted.Scope
benchmarks/swebench/run_infer.py) only, since that's where the regression surfaced.max_cost_per_instancefield lives on the sharedEvalMetadata, so plumbing it into the other benchmark runners is a one-line change per runner in a follow-up.What this does not fix
The underlying condenser cache-invalidation issue. Fixing that properly would need SDK-level changes (e.g. a condenser that keeps a stable cache prefix, or enforcement of
Metrics.max_budget_per_taskinside the run loop). Both are larger changes worth doing separately.Files
benchmarks/utils/cost_cap.py—CostCapCallbackclass with deferred binding (the callback needs a reference to the conversation, which can only be obtained after construction). Defensive error handling so a misbehaving metrics orpause()call can never take an instance down.tests/test_cost_cap.py— 9 unit tests using a fake conversation: rejects non-positive caps, no-op below cap, pauses at/above cap, idempotent once triggered, safe before binding, swallows metrics/pause failures.benchmarks/utils/models.py— addmax_cost_per_instance: float | None(gt=0).benchmarks/utils/args_parser.py— add--max-cost-per-instance.benchmarks/swebench/run_infer.py— construct the callback beforeConversation, bind it after.Test plan
pytest tests/test_cost_cap.py -v→ 9 passed.EvalMetadata(... max_cost_per_instance=0)is rejected by Pydantic and that the default value isNone.--max-cost-per-instance 7.5parses to7.5, absence parses toNone.Usage
This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini, in response to a review comment on OpenHands/openhands-index-results#1167.
@juanmichelini can click here to continue refining the PR